In [305]:
import pandas as pd
import numpy as np
%matplotlib inline
pandas
= 😍Since y'all love math, we're going to do plenty of both today.
One of the most basic ways to split statistics is to break it into two categories: descriptive and inferential.
We're going to be doing some basic descriptive statistics, because we sure aren't going to release our entire dataset to our readers. Summing it all up into a few numbers works much more nicely.
The two major categories of data are qualitative/categorical and quantitative/numerical. I'll use both words to describe each because I'm incapcable of picking a term and stick to it.
And, lucky us, there are kinds of numeric data!
You use descriptive statistics all the time! Averages! Maximums! Minimums! These old friends are your new friends, too.
We can break down descriptive statistics into a few major concepts, we'll talk about central tendency and variability. Let's take a look at those with a really dumb sample data set.
In [306]:
# Let's build a data set
df = pd.DataFrame([
{ 'name': 'Smushface', 'salary': 1200 } ,
{ 'name': 'Jen', 'salary': 25000 },
{ 'name': 'James', 'salary': 55000 },
{ 'name': 'John', 'salary': 35000 },
{ 'name': 'Josephine', 'salary': 25000 },
{ 'name': 'Jacques', 'salary': 15000 },
{ 'name': 'Bill Gates', 'salary': 100000 }
])
df
Out[306]:
If someone hears we have this data set about salaries, they're probably going to ask, "how much do people make?" They don't want a long list of numbers, they want a single, solitary number. We can get most of the way there by describing the central tendency.
Data in the world tends to clump around certain numbers - the average height of a man, or the average score on a test. This is called the central tendency, and is usually just called the average. Luckily for us average has like two hundred different meanings: mean, median, and mode.
Double-luckily for us, pandas
can compute all of those for us with appropriately-named functions.
In [307]:
(0 + 25000 + 55000 + 35000 + 25000 + 80000 + 20000000) / 7
Out[307]:
But like I said, pandas
can help us out here with the .mean()
method.
In [308]:
df['salary'].mean()
Out[308]:
That looks ugly, let's convert it to an integer!
In [309]:
df['salary'].mean().astype(int)
Out[309]:
If we want to get real crazy, we can add commas to it using our old friend .format()
In [310]:
# ***format trick stolen from http://stackoverflow.com/a/10742904
mean_salary = df['salary'].mean().astype(int)
"{:,}".format(mean_salary)
Out[310]:
But back to the mean: apparently the average of all of these salaries is over two million dollars. Does that look right to you?
In [311]:
df
Out[311]:
The problem with adding everything together is Bill Gates is exerting undue influence. His salary is an outlier - a number that's either way too high or way too low and kind of screws up our data. He might actually be making that much money, sure, but by taking the mean we aren't doing a good job describing what we'd think of as the "average."
Because of how it's calculated, the mean is suseptible to outliers. Because you need to be so careful with it, the mean is definitely not my favorite way of getting the average.
In [312]:
df['salary'].sort_values()
Out[312]:
We have seven values, so it will be number four. Count up the list to discover it: 35,000 is the median. I'll prove it, too, using the power of pandas
.
In [313]:
df['salary'].median()
Out[313]:
See? Told you!
If you happen to have an even number of data points you won't have a middle number, you'll take the mean of the middle two numbers.
My favorite description of the median comes from Statistics for the Terrified
We are all much more familiar with the mean - why? People like using the mean because it is a much easier thing to deal with than the median, mathematically, particularly in more complex situations... ... Always use the median when the distribution is skewed. You can use either the mean or the median when the population is symmetrical, because then they will give almost identical results.
Which to me reads like "if you have a computer, use the median."
The mode is the least-used measurement of central tendency: it's the most popular value. Even though our salary dataset has a most popular value, the mode actually shouldn't be used with continuous data, you should only use it with discrete data.
Let's say our buddies are reviewing a restaurant
In [314]:
import pandas as pd
# Let's build a data set
reviews_df = pd.DataFrame([
{ 'restaurant': 'Burger King', 'reviewer': 'Smushface', 'yelp_stars': 2 } ,
{ 'restaurant': 'Burger King', 'reviewer': 'Jen', 'yelp_stars': 2 },
{ 'restaurant': 'Burger King', 'reviewer': 'James', 'yelp_stars': 5 },
{ 'restaurant': 'Burger King', 'reviewer': 'John', 'yelp_stars': 4 },
{ 'restaurant': 'Burger King', 'reviewer': 'Josephine', 'yelp_stars': 4 },
{ 'restaurant': 'Burger King', 'reviewer': 'Jacques', 'yelp_stars': 3 },
{ 'restaurant': 'Burger King', 'reviewer': 'Bill Gates', 'yelp_stars': 2 }
])
reviews_df
Out[314]:
In [315]:
reviews_df['yelp_stars'].mode()
Out[315]:
Despite the fact that most people gave Burger King a 3
or above, the fact that the most popular score is 2
might mean something.
My favorite example of the mode being useful (and possibly only example of the mode being useful) is Amazon reviews. For example, this charger for a MacBook has some... interesting reviews.
In [316]:
# Import this because it won't let us display images otherwise
from IPython.display import display, HTML
In [317]:
display(HTML('''<img src="">'''))
Look at that adapter!
2.5 stars, not too shabby. And so cheap! The real ones are like 80 bucks, I think.
Let's take a look at the actual distribution of the scores...
In [318]:
display(HTML('''<img src="">'''))
Oh wait, the mode of the data is 1. The adapters are probably terrible, cancel that order.
There are three measures of central tendency.
The median should probably be favorite.
In [319]:
df.head()
Out[319]:
In [320]:
# Make it so we only have three decimal places
pd.set_option('display.float_format', lambda x: '%.3f' % x)
df.describe()
Out[320]:
And you see the mean displayed in all its glory, and so you scream, it doesn't include the median what garbage!!!
But it does! Really! Give it a look around and see if you can find it.
waiting!
waiting!
waiting!
Yes, that's right - 50% is the median. Half of the values are above, half are below. The 25% and 75% are similar meaures:
25% can be thought of as the median of the bottom half of the data, and 75% can describe the median of the top half of the data. They give you a sense of the range of data.
If we get tired of looking at lists of numbers, there's always box-and-whisker plots. They're the visual version of .describe()
- they describe the minimum, Q1, median, Q3, and maximum.
Well, usually.
In [321]:
df.boxplot(column='salary', sym='o', return_type='axes')
Out[321]:
The box part is 25%-75%, with the red line being the median. The whiskers are usually the maximum and minimum, but matplotlib
/pandas
likes to display it as "oh this is where nice values live" (a.k.a. IQR 1.5). We can make it do max and min by passing whis='range'
when we make the box plot.
In [322]:
df.boxplot(column='salary', sym='o', whis='range', return_type='axes')
Out[322]:
It might look a little bit better if we pull in an actual data set. Let's use those billionaires we worked on before.
In [323]:
rich_df = pd.read_excel("rich_people.xlsx")
rich_df = rich_df[rich_df['year'] == 2014]
rich_df.head(2)
Out[323]:
In [324]:
rich_df = rich_df.dropna(subset=['age', 'networthusbillion', 'foundingdate'])
rich_df.describe()
Out[324]:
In [325]:
rich_df.boxplot(column='networthusbillion', whis='range', return_type='axes')
Out[325]:
In [326]:
rich_df.boxplot(column='age', whis='range', return_type='axes')
Out[326]:
I'm going to steal a set of numbers from Khan Academy. Let's say we have two very boring sets of numbers.
In [327]:
list_one = pd.Series([-10, 0, 10, 20, 30])
list_two = pd.Series([8, 9, 10, 11, 12])
print("list one is")
print(list_one)
print("list two is")
print(list_two)
Let's use their central tendencies to describe them.
In [328]:
print("The mean of list_one is", list_one.mean())
print("The mean of list_two is", list_one.mean())
print("The median of list_one is", list_one.median())
print("The median of list_two is", list_one.median())
Huh! But I mean, let's be honest: THESE LISTS OF NUMBERS ARE VERY DIFFERENT. If their central tendencies are the same, the way to describe them, then, is to talk about the spread, or how the actual numbers themselves are distributed.
So we learned about the range before, it's the difference between the smallest and largest number.
[-10, 0, 10, 20, 30]
, the range is 40
. It's much more dispersed.[8, 9, 10, 11, 12]
, the range is 4
. It's much tighter.That's helpful! But there are more ways to measure the spread than just range.
Along with range, there are two other things we need to learn about how these numbers are distributed: variance and standard deviation.
In [329]:
# Data points [-10, 0, 10, 20, 30]
# Mean: 10
((-10 - 10)**2 + (0 - 10)**2 + (10 - 10)**2 + (20 - 10)**2 + (30 - 10)**2) / 5
Out[329]:
In [330]:
# Data points [8, 9, 10, 11, 12]
# Mean: 10
((8 - 10)**2 + (9 - 10)**2 + (10 - 10)**2 + (11 - 10)**2 + (12 - 10)**2) / 5
Out[330]:
In [331]:
# And pandas agrees
# Please don't ask why ddof=0 it has to do with sample variance
print(list_one.var(ddof=0))
print(list_two.var(ddof=0))
So first, the first data set has a much higher variance than the first variance.
In [332]:
import math
In [333]:
# Data points [-10, 0, 10, 20, 30]
# Variance: 200
math.sqrt(200)
Out[333]:
In [334]:
# Data points [8, 9, 10, 11, 12]
# Variance: 2.0
math.sqrt(2)
Out[334]:
The guy on Khan Academy is like "Yeah! The first data set has ten times the standard deviation than the second data set!" which he is really excited about. Since the standard deviation is 10x larger, think about it as "generally, a data point in the first data set is 10x further from the mean than in the second data set."
Range is easy. Variance and standard deviation are a little tougher - think of them as measurements of how far away from the mean your data generally is. High variance/standard deviation = numbers are generally spread out. Small variance/standard deviation = numbers are generally closer to the mean.
In [335]:
rich_df.sort_values(by='age', ascending=False).head(3)
Out[335]:
How strange is it that those old rich people are so old? We can see how many standard deviations they are away from the mean.
In [336]:
rich_df['age_std'] = ((rich_df['age'] - rich_df['age'].mean()).apply(abs) / rich_df['age'].std())
In [337]:
rich_df.sort_values(by='age', ascending=False).head(3)
Out[337]:
They are thirteen standard deviations away from the mean. Generally, 3.0 is considered a crazy outlier. 1.5 is considered maybe an outlier, but probably not really. So no one's looking crazy old here.
What about in terms of wealth?
In [338]:
rich_df['age'].plot(kind='box')
Out[338]:
In [339]:
rich_df['wealth_std'] = ((rich_df['networthusbillion'] - rich_df['networthusbillion'].mean()).apply(abs) / rich_df['networthusbillion'].std())
In [340]:
rich_df.sort_values(by='wealth_std', ascending=False).head(3)
Out[340]:
Hey, look at that! They're crazy wealthy! It's OUT OF CONTROL! THEY ARE SO WEALTHY. UNBELIEVABLE!!!!
We could also know that by looking at a simple histogram, of course.
In [341]:
rich_df['networthusbillion'].hist()
Out[341]:
In [342]:
rich_df['networthusbillion'].hist(bins=25)
Out[342]:
In [343]:
rich_df['networthusbillion'].hist(bins=50)
Out[343]:
In [344]:
rich_df['networthusbillion'].describe()
Out[344]:
Even though we have a ton of billionaires, basically everyone is a baby billionaire with barely billions of dollars. This is skewed data. Compare it with a histograph of age.
In [345]:
rich_df['networthusbillion'].hist(bins=25)
Out[345]:
In [346]:
rich_df['age'].hist(bins=25)
Out[346]:
Mostly billionaires are around 60, but sometimes they're younger or older. This is something you can use simple summary statistics on. The net worth data is skewed, and attempting to use your normal boring statistics on it is a terrible mistake.
The age data is called a normal distribution. It's nice, it's pleasant, it's normal. You can do normal things with it, like look for outliers. Let's read in some NBA data and look for outliers.
In [347]:
nba_df = pd.read_csv("nba.csv")
In [348]:
nba_df['WT'].describe()
Out[348]:
In [349]:
nba_df['WT'].hist()
Out[349]:
That mostly looks clustered around one value. But what's that weird one? Let's grab a standard deviation for every point.
In [350]:
nba_df['wt_std'] = ((nba_df['WT'] - nba_df['WT'].mean()).apply(abs) / nba_df['WT'].std())
nba_df.sort_values(by='wt_std', ascending=False).head(5)
Out[350]:
Jermaine Taylor has a weight that's 7 standard deviations away, which means we should probably look at it.
Oh look, he weighs 20 pounds. Does he actually weigh 20 pounds? We could do some research, but I'm thinking he doesn't.
How about we get rid of everyone that's a bad outlier? We only have one guy so far, but we might as well.
In [351]:
# Only keep people with a standard deviation of less than three
cleaned_nba_df = nba_df[nba_df['wt_std'] < 3]
In [352]:
cleaned_nba_df.sort_values(by='wt_std', ascending=False).head(3)
Out[352]:
Now that we got rid of some people, we can also recalculate the standard deviation! Remember, standard deviation is a relationship to the mean, and outliers move the mean.
In [353]:
cleaned_nba_df['new_wt_std'] = ((cleaned_nba_df['WT'] - cleaned_nba_df['WT'].mean()).apply(abs) / cleaned_nba_df['WT'].std())
cleaned_nba_df.sort_values(by='new_wt_std', ascending=False).head(5)
Out[353]:
So now it's a little bit crazier to be so far from the mean, but generally things are all legitimate and pleasant.
In [354]:
# Histogram of weights with uncleaned data
nba_df['WT'].hist()
Out[354]:
In [355]:
# Histogram of weights with cleaned data
cleaned_nba_df['WT'].hist()
Out[355]:
In [356]:
# Box-and-whisker plot of weights with uncleaned data
nba_df['WT'].plot(kind='box', whis='range')
Out[356]:
In [357]:
# Box-and-whisker plot of weights with cleaned data
cleaned_nba_df['WT'].plot(kind='box', whis='range')
Out[357]:
See how pleasant that is? So pleasant.
In [ ]: